Skip to content

Conversation

@ZiyiTsang
Copy link
Collaborator

This is the implementation to paper MAPO.

It doesn't require much code refactoring, still in the process.....

More "reay-to-go" algorithms and examples are the key to making a repo popular, especially RL-related.

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

@ZiyiTsang

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

@ZiyiTsang

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

@ZiyiTsang

This comment was marked as resolved.

@ZiyiTsang ZiyiTsang requested a review from Copilot September 28, 2025 18:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ZiyiTsang
Copy link
Collaborator Author

Ready for review.

# since the advantages is same within same trajectory, we can get the trajectory_level advantage from first token
# base on assumption that the advantage on last dim are totally same

advantages_ = advantages[:, 0] # advantages shape [batch_size*group_size]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line does not take any effect and should be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, useful. Please see the code comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adv of first token is extract and use for below logic.

Comment on lines +1410 to +1412
return (
1 - trajectory_reweight
) * deviation_base_norm + trajectory_reweight * mean_base_norm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-check the formula. Since the trajectory_weight is computed as 4p(1-p) rather than 1-4p(1-p), should we reverse the weighting of these two norms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my mistake. thank you

Copy link
Collaborator Author

@ZiyiTsang ZiyiTsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for second review.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

@github-actions github-actions bot added the stale label Oct 29, 2025
@github-actions github-actions bot removed the stale label Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants